WHAT IS CLAIMED IS: 

1 1 . A method of creating an audio-centric audio-visual summary of a video program, said 

2 video program having an audio track and an image track, said method comprising: 

3 selecting a length of time L sum of said audio-visual summary; 

4 examining said audio track and image track; 

5 identifying one or more audio segments from said audio track based on one or more 

6 predetermined audio, image, speech, and text characteristics which relate to desired content 

7 of said audio-visual summary, wherein said identifying is performed in accordance with a 
•jZ' 8 machine learning method which relies on previously-generated experience-based learning 
C.' 9 data to provide, for each of said audio segments in said video program, a probability that a 
111 10 given audio segment is suitable for inclusion in said audio-visual summary; 

Ul 1 1 adding said audio segments to said audio-visual summary; 

jlT 12 performing said identifying and adding in descending order of said probability until 

|S 13 the length of time L sum is reached; and 

M=. 14 selecting only one or more image segments corresponding to the one or more 

1 5 identified audio segments, so as to yield a high degree of synchronization between said one or 

16 more audio segments and said one or more image segments. 

12. A method as claimed in claim 1, wherein said identifying further comprises detecting 

2 audio segments comprising non-speech sounds; classifying said non-speech sounds according 

3 to contents; and, for each of said non-speech sounds, outputting a starting time code, length, 

4 and category. 

1 3. A method as claimed in claim 2, wherein, when said audio segments comprise speech, 

2 said identifying comprises performing speech recognition on said audio segments to generate 
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speech transcripts, and outputting a starting time code and length for each of said speech 
transcripts. 

4. A method as claimed in claim 3, wherein, when there is closed captioning present, 
said method further comprises aligning the closed captioning and the speech transcripts. 

5. A method as claimed in claim 4, wherein said identifying further comprises 
generating speech units either based on said aligning, if said closed captioning is present, or 
based on said speech transcripts, if said closed captioning is not present, and creating a 
feature vector for each of said speech units. 

6. A method as claimed in claim 5, further comprising computing an importance rank for 
each of said speech units. 

7. A method as claimed in claim 6, further comprising receiving said speech units and 
determining identities of one or more speakers. 

8. A method as claimed in claim 1, wherein said identifying further comprises 
segmenting said image track into individual image segments. 

9. A method as claimed in claim 8, further comprising extracting image features and 
forming an image feature vector for each of said image segments. 

10. A method as claimed in claim 9, further comprising determining identities of one or 
more faces for each of said image segments. 

11. A method as claimed in claim 1, wherein said probability is computed in accordance 
with a method selected from the group consisting of a Naive Bayes method, a decision tree 
method, a neural network method, and a maximum entropy method. 
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1 12. A method of creating an image-centric audio-visual summary of a video program, 

2 said video program having an audio track and an image track, said method comprising: 

3 selecting a length of time L sum of said audio-visual summary; 

4 examining said image track and audio track of said video program; 

5 identifying one or more image segments from said image track based on one or more 

6 predetermined image, audio, speech, and text characteristics which relate to desired content 

7 of said audio-visual summary, wherein said identifying is performed in accordance with a 

8 machine learning method which relies on previously-generated experience-based learning 

9 data to provide, for each of said image segments in said video program, a probability that a 
~J10 given image segment is suitable for inclusion in said audio-visual summary; 

Ijl 1 adding said one or more image segments to said audio-visual summary; 

A 12 performing said identifying and adding in descending order of said probability until 

III 3 the length of time L sum is reached; and 

jf 4 selecting only one or more audio segments corresponding to the one or more 

«al5 identified image segments, so as to yield a high degree of synchronization between said one 

16 or more image segments and said one or more audio segments. 

1 13. A method as claimed in claim 12, wherein said identifying comprises segmenting said 

2 image track into individual image segments. 

1 14. A method as claimed in claim 13, further comprising extracting image features and 

2 forming an image feature vector for each of said image segments. 

1 15. A method as claimed in claim 14, further comprising determining identities of one or 

2 more faces for each of said image segments. 
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16. A method as claimed in claim 12, further comprising selecting a minimum playback 
time for each of said image segments in said audio-visual summary. 

17. A method as claimed in claim 16, wherein is sufficiently small relative to 
L sum such that a relatively large number of audio segments and image segments are provided 
in said audio-visual summary, to provide a breadth-oriented audio-visual summary. 

18. A method as claimed in claim 16, wherein is sufficiently large relative to 
L sum such that a relatively small number of audio segments and image segments are provided 
in said audio-visual summary, to provide a depth-oriented audio-visual summary. 

19. A method as claimed in claim 12, wherein said identifying further comprises 
detecting audio segments comprising non-speech sounds; classifying said non-speech sounds 
according to contents; and, for each of said non-speech sounds, outputting a starting time 
code, length, and category. 

20. A method as claimed in claim 19, wherein, when said audio segments comprise 
speech, said identifying further comprises performing speech recognition on said audio 
segments to generate speech transcripts, and outputting a starting time code and length for 
each of said speech transcripts. 

21. A method as claimed in claim 20, wherein, when there is closed captioning present, 
said method further comprises aligning the closed captioning and the speech transcripts. 

22. A method as claimed in claim 21, wherein said identifying further comprises 
generating speech units either based on said aligning, if said closed captioning is present, or 
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based on said speech transcripts, if said closed captioning is not present, and creating a 
feature vector for each of said speech units. 

23. A method as claimed in claim 22, further comprising computing an importance rank 
for each of said speech units. 

24. A method as claimed in claim 23, further comprising receiving said speech units and 
determining identities of one or more speakers. 

25. A method as claimed in claim 12, wherein said probability is computed in accordance 
with a method selected from the group consisting of a Naive Bayes method, a decision tree 
method, a neural network method, and a maximum entropy method. 

26. A method of creating an integrated audio-visual summary of a video program, said 
video program having an audio track and a video track, said method comprising: 

selecting a length of timely of said audio-visual summary; 

selecting a minimum playback time L Mn for each of said image segments to be 
included in the audio-visual summary; 

creating an audio summary by selecting one or more desired audio segments until the 
audio-visual summary length L sum is reached, said selecting being determined in accordance 
with a machine learning method which relies on previously-generated experience-based 
learning data to provide, for each of said audio segments in said video program, a probability 
that a given audio segment is suitable for inclusion in said audio-visual summary; 

computing, for each of said image segments, a probability that a given image segment 
is suitable for inclusion in said audio-visual summary in accordance with said machine 
learning method; 
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14 for each of said audio segments that are selected, examining a corresponding image 

15 segment to see whether a resulting audio segment/image segment pair meets a predefined 

1 6 alignment requirement; 

17 if the resulting audio segment/image segment pair meets the predefined alignment 

18 requirement, aligning the audio segment and the image segment in the pair from their 

19 respective beginnings for said minimum playback time to define a first alignment point; 

20 repeating said examining and aligning to identify all of said alignment points; 

21 dividing said length of said audio-visual summary into a plurality of partitions, each 
H22 of said partitions having a time period 

r%3 either starting from a beginning of said audio-visual summary and ending at 

%24 the first alignment point; or 

ijf 5 starting from an end of the image segment at one alignment point, and ending 

N26 at a next alignment point; or 

R|7 starting from an end of the image segment at a last alignment point and ending 

1^8 at the end of said audio-visual summary; and 

gs3t 

29 for each of said partitions, adding further image segments in accordance with the 

30 following: 

31 identifying a set of image segments that fall into the time period of that 

32 partition; 

33 determining a number of image segments that can be inserted into said 

34 partition; 

35 determining a length of the identified image segments to be inserted; 
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36 selecting said number of the identified image segments in descending order of 

37 sai d probability that a given image segment is suitable for insertion in said audio- 

38 visual summary; and 

39 from ea ch of the selected image segments, collecting a section from its 

40 respective beginning for said time length and adding all the collected sections in 

41 ascending time order into said partition. 

1 27. A method as claimed in claim 26, wherein said identifying further comprises 

2 detecting audio segments comprising non-speech sounds; classifying said non-speech sounds 
Q3 according to contents; and, for each of said non-speech sounds, outputting a starting time 
H ! 4 code, length, and category. 

j~l 28. A method as claimed in claim 27, wherein, when said audio segments comprise 

y2 speech, said identifying further comprises performing speech recognition on said audio 

III 3 segments to generate speech transcripts, and outputting a starting time code and length for 

Q4 each of said speech transcripts. 

1 29. A method as claimed in claim 28, wherein, when there is closed captioning present, 

2 said method further comprises aligning the closed captioning and the speech transcripts. 

1 30. A method as claimed in claim 29, further comprising generating speech units either 

2 based on said aligning, if said closed captioning is present, or based on said speech 

3 transcripts, if said closed captioning is not present, and creating a feature vector for each of 

4 said speech units. 

1 31. A method as claimed in claim 30, further comprising computing an importance rank 

2 for each of said speech units. 
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1 32. A method as claimed in claim 31, further comprising receiving said speech units and 

2 determining identities of one or more speakers. 

1 33. A method as claimed in claim 26, wherein L A is sufficiently small relative to 

2 L sum such that a relatively large number of image segments are provided in said audio-visual 

3 summary, to provide a breadth-oriented audio-visual summary. 

1 34. A method as claimed in claim 26, wherein is sufficiently large relative to 

2 L sum such ^at a relatively small number of image segments are provided in said audio-visual 
C 3 summary, to provide a depth-oriented audio-visual summary. 

H*l 35. A method as claimed in claim 26, wherein said probability that said given audio 

H2 segment is suitable for inclusion in said audio-visual summary is computed in accordance 

y? with a method selected from the group consisting of a Naive Bayes method, a decision tree 

fy4 method, a neural network method, and a maximum entropy method. 

yd 36. A method as claimed in claim 26, wherein said probability that said given image 

2 segment is suitable for inclusion in said audio-visual summary is computed in accordance 

3 with a method selected from the group consisting of a Naive Bayes method, a decision tree 

4 method, a neural network method, and a maximum entropy method. 

1 37. A method as claimed in claim 26, wherein said identifying further comprises 

2 segmenting said image track into individual image segments. 

1 38. A method as claimed in claim 37, further comprising extracting image features and 

2 forming an image feature vector for each of said image segments. 
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1 39. A method as claimed in claim 38, further comprising determining identities of one or 

2 more faces for each of said image segments. 

1 40. A method of creating an audio-centric audio-visual summary of a video program, said 

2 video program having an audio track and an image track, said method comprising: 

3 selecting a length of time L sum of said audio-visual summary; 

4 examining said audio track and image track; 

5 identifying one or more audio segments from said audio track based on one or more 

6 predetermined audio, image, speech, and text characteristics which relate to desired content 
of said audio-visual summary, wherein said identifying is performed in accordance with a 

j~ 8 predetermined set of heuristic rules to provide, for each of said audio segments in said video 

ji 9 program, a ranking so as to determine whether a given audio segment is suitable for inclusion 

Ul 

3 . 10 in said audio-visual summary; 

Q 1 adding said audio segments to said audio-visual summary; 

j<p performing said identifying and adding in descending order of said ranking of audio 

' 13 segments until the length of time L sum is reached; and 

14 selecting only one or more image segments corresponding to the one or more 

1 5 identified audio segments, so as to yield a high degree of synchronization between said one or 

16 more audio segments and said one or more image segments. 

1 41. A method as claimed in claim 40, wherein said identifying further comprises 

2 detecting audio segments comprising non-speech sounds; classifying said non-speech sounds 

3 according to contents; and, for each of said non-speech sounds, outputting a starting time 

4 code, length, and category. 
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1 42. A method as claimed in claim 41, wherein, when said audio segments comprise 

2 speech, said identifying comprises performing speech recognition on said audio segments to 

3 generate speech transcripts, and outputting a starting time code and length for each of said 

4 speech transcripts. 

1 43. A method as claimed in claim 42, wherein, when there is closed captioning present, 

2 said method further comprises aligning the closed captioning and the speech transcripts. 

1 44. A method as claimed in claim 43, further comprising generating speech units either 
1^2 based on said aligning, if said closed captioning is present, or based on said speech 
p 3 transcripts, if said closed captioning is not present, and creating a feature vector for each of 
H : 4 said speech units. 

^ 1 45. A method as claimed in claim 44, further comprising receiving said speech units and 

: Z. 2 determining identities of one or more speakers. 

ifi 

f% 1 46. A method as claimed in claim 40, wherein said identifying comprises segmenting said 

2 image track into individual image segments. 

1 47. A method as claimed in claim 46, further comprising extracting image features and 

2 forming an image feature vector for each of said image segments. 

3 48. A method as claimed in claim 47, further comprising determining identities of one or 

4 more faces for each of said image segments. 

1 49. A method as claimed in claim 40, further comprising computing said ranking for each 

2 of said speech units. 
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1 50. A method of creating an image-centric audio-visual summary of a video program, 

2 said video program having an audio track and an image track, said method comprising: 

3 selecting a length of time L sum of said summary; 

4 examining said image track and audio track; 

5 identifying one or more image segments from said image track based on one or more 

6 predetermined image, audio, speech, and text characteristics which relate to desired content 

7 of said audio-visual summary, wherein said identifying is performed in accordance with a 

8 predetermined set of heuristic rules to provide, for each of said image segments in said video 
2 9 program, a ranking so as to determine whether a given image segment is suitable for 
r%0 inclusion in said audio- visual summary; 

pj. 1 adding said one or more image segments to said audio-visual summary; 

III 2 performing said identifying and adding in descending order of said ranking until the 

H3 length of time L sum is reached; and 

(14 selecting only one or more audio segments corresponding to the one or more 

|45 identified image segments, so as to yield a high degree of synchronization between said one 

16 or more image segments and said one or more audio segments. 

1 51. A method as claimed in claim 50, wherein said identifying comprises clustering 

2 image segments of said video program based on predetermined visual similarity and dynamic 

3 characteristics. 

1 52. A method as claimed in claim 51, wherein said identifying comprises segmenting said 

2 image track into individual image segments. 

1 53. A method as claimed in claim 52, further comprising extracting image features and 

2 forming an image feature vector for each of said frame clusters. 
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1 54. A method as claimed in claim 53, further comprising determining identities of one or 

2 more faces for each of said frame clusters. 

1 55. A method as claimed in claim 50, wherein said identifying further comprises 

2 detecting audio segments comprising non-speech sounds, classifying said non-speech sounds 

3 according to contents; and, for each of said non-speech sounds, outputting a starting time 

4 code, length, and category. 

1 56. A method as claimed in claim 55, wherein, when said audio segments comprise 

IU 2 speech, said identifying comprises performing speech recognition on said audio segments to 

S 3 generate speech transcripts, and outputting a starting time code and length for each of said 

H» 4 speech transcripts. 

ru 

fJI 1 57. A method as claimed in claim 56, wherein, when there is closed captioning present, 

JIT 2 said method further comprises aligning the closed captioning and the speech transcripts. 

1 58. A method as claimed in claim 57, further comprising generating speech units either 

2 based on said aligning, if said closed captioning is present, or based on said speech 

3 transcripts, if said closed captioning is not present, and creating a feature vector for each of 

4 said speech units. 

1 59. A method as claimed in claim 58, further comprising computing an importance rank 

2 for each of said speech units. 

1 60. A method as claimed in claim 59, further comprising receiving said speech units and 

2 determining identities of one or more speakers. 
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61. A method as claimed in claim 50, further comprising selecting a minimum playback 
time L um for ea °h of sa id image segments in said audio-visual summary. 

62. A method as claimed in claim 61, wherein is sufficiently small relative to 
L sum such that a relatively large number of audio segments and image segments are provided 
in said audio-visual summary, to provide a breadth-oriented audio-visual summary. 

63. A method as claimed in claim 61, wherein is sufficiently large relative to 
L sum such that a relatively small number of audio segments and image segments are provided 
in said audio-visual summary, to provide a depth-oriented audio-visual summary. 

64. A method of creating an integrated audio-visual summary of a video program, said 
video program having an audio track and a video track, said method comprising: 

selecting a length L sum of said audio-visual summary; 

selecting a minimum playback time for each of a plurality of image segments to 
be included in the audio-visual summary; 

creating an audio summary by selecting one or more desired audio segments, said 
selecting being determined in accordance with a predetermined set of heuristic rules to 
provide, for each of said audio segments in said video program, a ranking to determine 
whether a given audio segment is suitable for inclusion in said video summary; 

performing said selecting in descending order of said ranking of audio segments until 
said audio-visual summary length is reached; 

grouping said image segments of said video program into a plurality of frame clusters 
based on a visual similarity and a dynamic level of said image segments, wherein each frame 
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i 14 cluster comprises at least one of said image segments, with all the image segments within a 

1 5 given frame cluster being visually similar to one another; 

16 for each of said audio segments that are selected, examining a corresponding image 

17 segment to see whether a resulting audio segment/image segment pair meets a predefined 

1 8 alignment requirement; 

19 if the resulting audio segment/image segment pair meets the predefined alignment 

20 requirement, aligning the audio segment and the image segment in the pair from their 

21 respective beginnings for said minimum playback timely to define a first alignment point; 
5~|2 repeating said examining and aligning to identify all of said alignment points; 

;3 3 dividing said length of said audio-visual summary into a plurality of partitions, each 

1*24 of said partitions having a time period 

iS5 either starting from a beginning of said audio-visual summary and ending at 

HE 6 the first alignment point; or 

- 27 starting from an end of the image segment at one alignment point, and ending 

r28 at a next alignment point; or 

29 starting from an end of the image segment at a last alignment point and ending 

30 at the end of said audio-visual summary; and 

3 1 dividing each of said partitions into a plurality of time slots, each of said time slots 

32 having a length equal to said minimum playback time I m ; 

33 assigning said frame clusters to fill said time slots of each of said partitions based on 

34 the following: 

35 assigning each frame cluster to only one time slot; and 

36 maintaining a time order of all image segments in the audio- visual summary; 
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37 wherein said assigning said frame clusters to fill said time slots is performed in 

38 accordance with a best matching between said frame clusters and said time slots. 

1 65. A method as claimed in claim 64, wherein said best matching is computed by a 

2 method of maximum-bipartite-matching. 

1 66. A method as claimed in claim 65, wherein, if there are more time slots than frame 

2 clusters, identifying those frame clusters which contain more than one image segment, and 

3 assigning image segments from said identified frame clusters to time slots until all of said 
Ui 4 time slots are filled, while maintaining said time order of said image segments in said audio- 
q 5 visual summary. 

|1| 1 67. A method as claimed in claim 66, further comprising reviewing said audio-visual 

U12 summary to ensure that said time order is maintained, and, if said time order is not 

\Z 3 maintained, reordering said image segments that were added in each partition so that said 

I J-J4 time order is maintained. 

s 

1 68. A method as claimed in claim 64, wherein said identifying further comprises 

2 detecting audio segments comprising non-speech sounds, classifying said non-speech sounds 

3 according to contents; and, for each of said non-speech sounds, outputting a starting time 

4 code, length, and category. 

1 69. A method as claimed in claim 68, wherein, when said audio segments comprise 

2 speech, said identifying comprises performing speech recognition on said audio segments to 

3 generate speech transcripts, and outputting a starting time code and length for each of said 

4 speech transcripts. 
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1 70. A method as claimed in claim 69, wherein, when there is closed captioning present, 

2 said method further comprises aligning the closed captioning and the speech transcripts. 

1 71. A method as claimed in claim 70, further comprising generating speech units either 

2 based on said aligning, if said closed captioning is present, or based on said speech 

3 transcripts, if said closed captioning is not present, and creating a feature vector for each of 

4 said speech units. 

1 72. A method as claimed in claim 71, further comprising computing an importance rank 

;«** 2 for each of said speech units. 

M> 1 73. A method as claimed in claim 72, further comprising receiving said speech units and 

IlJ2 determining identities of one or more speakers. 

L 1 74 " A method as claimed in claim 64, wherein is sufficiently small relative to 

III 2 L sum such that a relatively large number of image segments are provided in said audio-visual 

y i 

M3 summary, to provide a breadth-oriented audio- visual summary. 

1 75. A method as claimed in claim 64, wherein is sufficiently large relative to 

2 L sum suc h &at a relatively small number of image segments are provided in said audio-visual 

3 summary, to provide a depth-oriented audio-visual summary. 

1 76. A method as claimed in claim 64, wherein said identifying comprises segmenting said 

2 image track into individual image segments. 

1 77. A method as claimed in claim 76, further comprising extracting image features and 

2 forming an image feature vector for each of said frame clusters. 
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78. A method as claimed in claim 77, further comprising determining identities of one or 
more faces for each of said image segments. 
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